{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Fun with ML Production\n",
    "\n",
    "This is a composition of fun things (tricks) that one may try in ML production environment.\n",
    "\n",
    "For quick demonstration purposes, we will be using the MNIST dataset.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from tensorflow.keras.datasets import mnist\n",
    "import numpy as np\n",
    "(x_train, y_train), (x_test, y_test) = mnist.load_data()\n",
    "x_train = (x_train / 255.0).astype(np.float32)\n",
    "x_test  = (x_test / 255.0).astype(np.float32)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Model Aggregation (like Ensemble)\n",
    "\n",
    "### Intro\n",
    "\n",
    "Alternate method to train multiple versions of the model in parallel where each has a different weight initialization.\n",
    "\n",
    "In this method, I use model aggregation. The tecnique is similar to an 'ensemble' training, except in ensemble each model is trained independently. In aggregation, the models are conditionally dependent on each other during training.\n",
    "\n",
    "### Method\n",
    "\n",
    "Consider a model that is composed of a stem (entry) group, a learner (features are learned) group and classifier (exit) group --such as in the Idiomatic macro-architecture design pattern for models.\n",
    "\n",
    "                                stem => learner => classifier\n",
    "                                \n",
    "In aggregation, we start with the stem and the add multiple branches, where each branch has a separate copy of the learneer+classifier group of the model, as depicted below:\n",
    "\n",
    "                                       inputs\n",
    "                                        stem\n",
    "\n",
    "                    learner_1        learner_2        learner_3\n",
    "                    classifier_1     classifier_2     classifier_2\n",
    "                    output_1         output_2         output_3\n",
    "                \n",
    "When instantiating the model, we will use a single input with multiple outputs:\n",
    "\n",
    "                    model = Model(inputs, [output_1, output_2, output_3])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from tensorflow.keras import Input, Model, Sequential\n",
    "from tensorflow.keras.layers import Dense, Flatten\n",
    "from tensorflow.keras.optimizers import Adam\n",
    "\n",
    "# Stem\n",
    "inputs = Input((28, 28))\n",
    "x = Flatten()(inputs)\n",
    "\n",
    "# Model 1 branch\n",
    "x1 = Dense(64, activation='relu', name=\"64_1\")(x)  # input is output from the stem\n",
    "x1 = Dense(128, activation='relu', name=\"128_1\")(x1)\n",
    "o1 = Dense(10, activation='softmax', name=\"output1\")(x1)\n",
    "\n",
    "# Model 2 branch\n",
    "x2 = Dense(64, activation='relu', name=\"64_2\")(x)\n",
    "x2 = Dense(128, activation='relu', name=\"128_2\")(x2)\n",
    "o2 = Dense(10, activation='softmax', name=\"output2\")(x2)\n",
    "\n",
    "# Model 3 branch\n",
    "x3 = Dense(64, activation='relu', name=\"64_3\")(x)\n",
    "x3 = Dense(128, activation='relu', name=\"128_3\")(x3)\n",
    "o3 = Dense(10, activation='softmax', name=\"output3\")(x3)\n",
    "\n",
    "model = Model(inputs, [o1, o2, o3])\n",
    "model.compile(loss='sparse_categorical_crossentropy', optimizer=Adam(lr=0.001), metrics=['acc'])\n",
    "model.summary()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Train\n",
    "\n",
    "This now train all three models (branches) in parallel. Note for the label data we pass three copies of the labels --i.e., we need to specify a corresponding label set per classifier output.\n",
    "\n",
    "Let's point out two things:\n",
    "\n",
    "The layers for each branch are trained separate from each other.\n",
    "\n",
    "But, the training of the stem will be an aggregation from the training of the branches. Since each branch has a different weight initialization, the updates on the weights in the stem are like a push and pull effect --thus providing regularization (a tad bit of noise).\n",
    "\n",
    "Likewise, that noise is reflected back down the branches on the next forward feed batch --providing regularization through all layers."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model.fit(x_train, [y_train, y_train, y_train], epochs=10, batch_size=32, validation_split=0.1, verbose=1)\n",
    "model.evaluate(x_test, [y_test, y_test, y_test])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Cutout the Best Model\n",
    "\n",
    "Next, we use the evaluation results to pick which branch got the highest accuracy (1, 2 or 3). Then we construct a new model using the trained layers of the corresponding branch."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Layer paths for the three models\n",
    "models = { '1' : [5, 8],\n",
    "           '2' : [6, 9],\n",
    "           '3' : [7, 10] }\n",
    "\n",
    "\n",
    "# common stem\n",
    "x = model.input\n",
    "x = model.layers[1](x)\n",
    "x = model.layers[2](x)\n",
    "\n",
    "# branch layers\n",
    "for n in models['3']:\n",
    "    x = model.layers[n](x)\n",
    "\n",
    "new_model = Model(inputs, x)\n",
    "new_model.compile(loss='sparse_categorical_crossentropy', optimizer=Adam(lr=0.001), metrics=['acc'])\n",
    "new_model.summary()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's now feed the test data again to the new 'cutout' model:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "n_model.evaluate(x_test, y_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Add another layer to trained model\n",
    "\n",
    "Let's do this again with training an aggregation of models. This time we will make the branched models one layer deeper, but we will reuse all (but the classifier) trained layers."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "inputs = Input((28, 28))\n",
    "x = model.layers[1](inputs)\n",
    "x = model.layers[2](x)\n",
    "\n",
    "x1 = model.layers[5](x)\n",
    "x1 = Dense(128, activation='relu')(x1)\n",
    "o1 = Dense(10, activation='softmax')(x1)\n",
    "\n",
    "\n",
    "x2 = model.layers[5](x)\n",
    "x2 = Dense(128, activation='relu')(x2)\n",
    "o2 = Dense(10, activation='softmax')(x2)\n",
    "\n",
    "\n",
    "x3 = model.layers[5](x)\n",
    "x3 = Dense(128, activation='relu')(x3)\n",
    "o3 = Dense(10, activation='softmax')(x3)\n",
    "\n",
    "model2 = Model(inputs, [o1, o2, o3])\n",
    "model2.compile(loss='sparse_categorical_crossentropy', optimizer=Adam(lr=0.001), metrics=['acc'])\n",
    "model2.summary()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's train it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model2.fit(x_train, [y_train, y_train, y_train], epochs=5, validation_split=0.1, verbose=1)\n",
    "model2.evaluate(x_test, [y_test, y_test, y_test])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## (Why Not to Use) Auxiliary Classifier\n",
    "\n",
    "Inception (2014) introduced the concept of auxiliary classifiers.\n",
    "\n",
    "The idea, is that the further away the classifier is from the entry (bottom), the less it will contribute to updating the those weights on each pass. The concept was to add separate output classifiers at earlier parts in the model (auxiliary) which make contributions to the entry layers --from a less distant point.\n",
    "\n",
    "In theory, the idea 'intuitively' made sense. But we don't see auxiliary classifiers subsequent to inception --there is a reason.\n",
    "\n",
    "### Method\n",
    "\n",
    "So, let's propose using this technique and see what happens. We will make our model be 3 dense layers, and after each dense layer we will add an auxiliary classifier. That is, each level will have it's own classifier.\n",
    "\n",
    "What we 'intuitively' expect is that each deeper level is more accurate than the last. Given that, we can pick a level of accuracy and cut off the remaining layers for a more compact model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from tensorflow.keras import Input, Model, Sequential\n",
    "from tensorflow.keras.layers import Dense, Flatten\n",
    "from tensorflow.keras.optimizers import Adam\n",
    "\n",
    "# Stem\n",
    "inputs = Input((28, 28))\n",
    "x = Flatten()(inputs)\n",
    "\n",
    "# Level 1 and Auxiliary Classifier\n",
    "x = Dense(128, activation='relu')(x)\n",
    "o1 = Dense(10, activation='softmax', name=\"level1\")(x)\n",
    "\n",
    "# Level 2 and Auxiliary Classifier\n",
    "x = Dense(128, activation='relu')(x)\n",
    "o2 = Dense(10, activation='softmax', name=\"level2\")(x)\n",
    "\n",
    "# Level 3 and Final Classifier\n",
    "x = Dense(128, activation='relu')(x)\n",
    "o3 = Dense(10, activation='softmax', name=\"level3\")(x)\n",
    "\n",
    "model3 = Model(inputs, [o1, o2, o3])\n",
    "model3.compile(loss='sparse_categorical_crossentropy', optimizer=Adam(lr=0.001), metrics=['acc'])\n",
    "model3.summary()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's train it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model3.fit(x_train, [y_train, y_train, y_train], epochs=5, validation_split=0.1, verbose=1)\n",
    "model3.evaluate(x_test, [y_test, y_test, y_test])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Results\n",
    "\n",
    "Note that as we went deeper, the accuracy did not go up -- but actually went down!\n",
    "\n",
    "Why? Each level is a separate 'solver' and in effect they are fighting each other. The more deeper, the more solvers fighting each other, the further the degradation in results!\n",
    "\n",
    "### Next Try\n",
    "This seems like it maybe a case of covariant shift? Perhaps we can solve this by adding in BatchNormalization() as a pre-activation before each auxiliary/final classifier."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from tensorflow.keras import Input, Model, Sequential\n",
    "from tensorflow.keras.layers import Dense, Flatten, BatchNormalization\n",
    "from tensorflow.keras.optimizers import Adam\n",
    "\n",
    "# Stem\n",
    "inputs = Input((28, 28))\n",
    "x = Flatten()(inputs)\n",
    "\n",
    "# Level 1 with pre-activation normalization\n",
    "x = Dense(128, activation='relu')(x)\n",
    "o1 = BatchNormalization()(x)\n",
    "o1 = Dense(10, activation='softmax', name=\"level1\")(o1)\n",
    "\n",
    "# Level 2 with pre-activation normalization\n",
    "x = Dense(128, activation='relu')(x)\n",
    "o2 = BatchNormalization()(x)\n",
    "o2 = Dense(10, activation='softmax', name=\"level2\")(o2)\n",
    "\n",
    "# Level 3 with pre-activation normalization\n",
    "x = Dense(128, activation='relu')(x)\n",
    "o3 = BatchNormalization()(x)\n",
    "o3 = Dense(10, activation='softmax', name=\"level3\")(o3)\n",
    "\n",
    "model4 = Model(inputs, [o1, o2, o3])\n",
    "model4.compile(loss='sparse_categorical_crossentropy', optimizer=Adam(lr=0.001), metrics=['acc'])\n",
    "model4.summary()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model4.fit(x_train, [y_train, y_train, y_train], epochs=5, validation_split=0.1, verbose=1)\n",
    "model4.evaluate(x_test, [y_test, y_test, y_test])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Result\n",
    "\n",
    "Nope, it doesn't help!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model5.fit(x_train, y_train, epochs=5, validation_split=0.1, verbose=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## End"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
